Overview

Dataset Statistics

Number of Variables 5
Number of Rows 387036
Missing Cells 0
Missing Cells (%) 0.0%
Duplicate Rows 0
Duplicate Rows (%) 0.0%
Total Size in Memory 241.4 MB
Average Row Size in Memory 654.0 B
Variable Types
  • Categorical: 4
  • Numerical: 1

Dataset Insights

score is skewed Skewed
self_text has a high cardinality: 379303 distinct values High Cardinality
created_time has a high cardinality: 362803 distinct values High Cardinality
created_time has constant length 19 Constant Length
controversiality has constant length 1 Constant Length
score has 31894 (8.24%) negatives Negatives

Variables


self_text

categorical

Approximate Distinct Count 379303
Approximate Unique (%) 98.0%
Missing 0
Missing (%) 0.0%
Memory Size 172813164

Length

Mean 246.0892
Standard Deviation 394.5077
Median 136
Minimum 1
Maximum 9997

Sample

1st row Very much indeed.
2nd row The scary thing is...
3rd row What I find terrif...
4th row Preach on!
5th row do you have a link...

Letter

Count 75182539
Lowercase Letter 72506443
Space Separator 15534004
Uppercase Letter 2676096
Dash Punctuation 207538
Decimal Number 672224
  • self_text contains many words: 243779 words
  • The largest value (i) is over 1.55 times larger than the second largest value (israel)

subreddit

categorical

Approximate Distinct Count 14
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 30363789
  • The largest value (IsraelPalestine) is over 2.46 times larger than the second largest value (worldnews)

Length

Mean 13.4521
Standard Deviation 2.8994
Median 14
Minimum 9
Maximum 21

Sample

1st row TerrifyingAsFuck
2nd row TerrifyingAsFuck
3rd row TerrifyingAsFuck
4th row TerrifyingAsFuck
5th row TerrifyingAsFuck

Letter

Count 5174817
Lowercase Letter 4521143
Space Separator 0
Uppercase Letter 653674
Dash Punctuation 0
Decimal Number 21088
  • The top 2 categories (IsraelPalestine, worldnews) take over 50.0%
  • The largest value (israelpalestine) is over 2.46 times larger than the second largest value (worldnews)

created_time

categorical

Approximate Distinct Count 362803
Approximate Unique (%) 93.7%
Missing 0
Missing (%) 0.0%
Memory Size 32511024

Length

Mean 19
Standard Deviation 0
Median 19
Minimum 19
Maximum 19

Sample

1st row 2023-10-07 00:51:3...
2nd row 2023-10-07 01:04:2...
3rd row 2023-10-07 01:06:4...
4th row 2023-10-07 01:32:5...
5th row 2023-10-07 02:35:1...

Letter

Count 0
Lowercase Letter 0
Space Separator 387036
Uppercase Letter 0
Dash Punctuation 774072
Decimal Number 5418504
  • created_time contains many words: 84555 words
  • created_time has words of constant length

score

numerical

Approximate Distinct Count 1881
Approximate Unique (%) 0.5%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 6192576
Mean 19.3619
Minimum -980
Maximum 9688
Zeros 19038
Zeros (%) 4.9%
Negatives 31894
Negatives (%) 8.2%
  • score is skewed right (γ1 = 25.1156)

Quantile Statistics

Minimum -980
5-th Percentile -3
Q1 1
Median 2
Q3 8
95-th Percentile 74
Maximum 9688
Range 10668
IQR 7

Descriptive Statistics

Mean 19.3619
Standard Deviation 124.4824
Variance 15495.8587
Sum 7.4938e+06
Skewness 25.1156
Kurtosis 1009.6446
Coefficient of Variation 6.4292
  • score is not normally distributed (p-value 4.451841367322307e-25)
  • score has 64611 outliers

controversiality

categorical

Approximate Distinct Count 2
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 25544376
  • The largest value (0) is over 15.57 times larger than the second largest value (1)

Length

Mean 1
Standard Deviation 0
Median 1
Minimum 1
Maximum 1

Sample

1st row 0
2nd row 0
3rd row 0
4th row 1
5th row 0

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 387036
  • The top 2 categories (0, 1) take over 50.0%
  • The largest value (0) is over 15.57 times larger than the second largest value (1)
  • controversiality has words of constant length

Interactions

Correlations

Missing Values